Use of predefined block pointers to reduce duplicate storage of certain data in a storage subsystem of a storage server

ABSTRACT

A method and system for eliminating the redundant allocation and deallocation of special data on disk, wherein the redundant allocation and deallocation of special data on disk is eliminated by providing an innovate technique for specially allocating special data of a storage system. Specially allocated data is data that is pre-allocated on disk and stored in memory of the storage system. “Special data” may include any pre-decided data, one or more portions of data that exceed a pre-defined sharing threshold, and/or one or more portions of data that have been identified by a user as special. For example, in some embodiments, a zero-filled data block is specially allocated by a storage system. As another example, in some embodiments, a data block whose contents correspond to a particular type document header is specially allocated.

PRIORITY CLAIM

This application is a continuation of U.S. patent application Ser. No.12/394,002, entitled “USE OF PREDEFINED BLOCK POINTERS TO REDUCEDUPLICATE STORAGE OF CERTAIN DATA IN A STORAGE SUBSYSTEM OF A STORAGESERVER”, which was filed on Feb. 26, 2009, and which is incorporated byreference herein in its entirety.

FIELD OF THE INVENTION

At least one embodiment of the present invention pertains to storagesystems, and more particularly, to a method and system for speciallyallocating data within a storage system when the data matches datapreviously designated as special data.

BACKGROUND

In a large file system, it is common to find duplicate occurrences ofindividual blocks of data. Duplication of data blocks may occur when,for example, two or more files or other data containers share commondata or where a given set of data occurs at multiple places within agiven file. Duplication of data blocks results in inefficient use ofstorage space by storing the identical data in a plurality of differentlocations served by a storage system.

A technique, commonly referred to as “deduplication,” that has been usedto address this problem involves detecting duplicate data blocks bycomputing a hash value (fingerprint) of each new data block that isstored on disk, and then comparing the new fingerprint to fingerprintsof previously stored blocks. When the fingerprint is identical to thatof a previously stored block, the deduplication process determines thatthere is a high degree of probability that the new block is identical tothe previously stored block. The deduplication process then compares thecontents of the data blocks with identical fingerprints to verify thatthey are, in fact, identical. In such a case, the block pointer to therecently stored duplicate data block is replaced with a pointer to thepreviously stored data block and the duplicate data block isdeallocated, thereby reducing storage resource consumption.

Deduplication processes assume that all data blocks have a similarprobability of being shared. However, this assumption does not hold truein certain applications. For example, this assumption does not oftenhold true in virtualization environments, where a single physicalstorage server is partitioned into multiple virtual machines. Typically,when a user creates an instance of a virtual machine, the user is giventhe option to specify the size of a virtual disk that is associated withthe virtual machine. Upon creation, the virtual disk image file isinitialized with all zeros. When the host system includes adeduplication process, such as the technique described above, thezero-filled blocks of the virtual disk image file may be “fingerprinted”and identified as duplicate blocks. The duplicate blocks are thendeallocated and replaced with a block pointer to a single instance ofthe block on disk. As a result, the virtual disk image file consumesless space on the host disk.

However, there are disadvantages associated with a single instance of ablock on disk being shared by a number of deallocated blocks. Onedisadvantage is that “hot spots” may occur on the host disk as a resultof the file system frequently accessing the single instance of the data.This may occur with high frequency due to the fact that the majority ofthe free space on the virtual disk references the single zero-filledblock. To reduce hot spots, some deduplication processes include aprovision for predefining a maximum number of shared block references(e.g., 255). When such a provision is implemented, the first 255duplicate blocks reference a first instance the shared block, the second255 duplicate blocks reference a second instance, and so on.

Another disadvantage of deduplication is disk fragmentation. Diskfragmentation may occur as a consequence of the duplicate blocks beingfirst allocated and then later deallocated by the deduplication process.Moreover, the redundant allocation and deallocation of duplicate blocksfurther results in unnecessary processing time and bookkeeping overhead.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the facility are illustrated by way ofexample and not limitation in the figures of the accompanying drawings,in which like references indicate similar elements and in which:

FIG. 1 is a data flow diagram of various components or services that arepart of a storage network, in one embodiment.

FIG. 2 is a high-level block diagram of a storage system, in oneembodiment.

FIG. 3 is a high-level block diagram showing an example of a storageoperating system, in one embodiment.

FIG. 4 illustrates functional elements of a special allocation layer, inone embodiment.

FIG. 5 illustrates an aggregate, in one embodiment.

FIG. 6 is a high-level block diagram of a container file for a flexiblevolume, in one embodiment.

FIG. 7 is a high-level block diagram of a file within a container file,in one embodiment.

FIG. 8 is a flow chart of a process for specially allocating data blocksprior to the data blocks being written to disk, in one embodiment.

FIG. 9 is a flow chart of a process for servicing a read request, in oneembodiment.

DETAILED DESCRIPTION

The technology introduced herein eliminates the redundant allocation anddeallocation of “special data” on disk by providing a technique forspecially allocating data within a storage system. As used herein,“special data” is one or more pieces of data that have been designatedas “special” within a host file system, such as one or more pieces ofdata that exceed a pre-defined sharing threshold. For example, in someembodiments, a zero-filled data block is specially allocated by astorage system. As another example, in some embodiments, a data blockwhose contents correspond to a particular type of document header isspecially allocated. It is noted that the technology introduced hereincan be applied to specially allocate any type of data. As such,references to particular data, such as zero-filled data, should not betaken as restrictive. It is further noted that the term “disk” is usedherein to refer to any computer-readable storage medium includingvolatile, nonvolatile, removable, and non-removable media, or anycombination of such media devices that are capable of storinginformation such as computer-readable instructions, data structures,program modules, or other data. It is also noted that the term “disk”may refer to physical or virtualized computer-readable storage media.

As described herein, a specially allocated data block is a block of datathat is pre-allocated on disk or another non-volatile mass storagedevice of a storage system. It is noted that the term “pre-allocated” isused herein to indicate that a specially allocated data block is storedon disk prior to a request to write the special data to disk, prior tooperation of the storage system, and/or set via a configurationparameter prior to, or during, operation of the storage system. That is,the storage system may be pre-configured to include one or more blocksof special data. In some embodiments, a single instance of the specialdata is pre-allocated on disk. When a request to write data matching thespecial data is received, the storage system does not write the receiveddata to disk. Instead, the received data is assigned a special pointerthat identifies the location on disk at which the special data waspre-allocated. In some embodiments, the storage manager maintains a datastructure or other mapping of specially allocated data blocks and theircorresponding special pointers. When a request to read speciallyallocated data is received, the storage system recognizes the specialpointer as corresponding to specially allocated data and, instead ofissuing a request to read the data from disk, the storage system readsthe data from memory (e.g., RAM).

By accessing special data in memory (e.g., RAM) of the storage system,requests for such data can be responded to substantially faster becausethe special data may be read without accessing the disk. Moreover, bynot issuing write requests to store special data on disk, the technologyintroduced herein avoids disk fragmentation caused by freeing duplicatedata blocks. Also, by not issuing write requests to store special dataon disk, the technology introduced herein substantially reduces theprocessing time and overhead associated with deduplicating duplicatedata blocks. In addition, by not issuing requests to access special dataexceeding a pre-defined sharing threshold on disk, the technologyintroduced herein eliminates hot spots associated with reading a singleinstance of a deduplicated data block.

The technology introduced herein can be implemented in accordance with avariety of storage architectures including, but not limited to, anetwork attached storage (NAS) configuration, a storage area network(SAN) configuration, a multi-protocol storage system, or a disk assemblydirectly attached to a client or host computer (referred to as a directattached storage (DAS)), for example. The storage system may include oneor more storage devices, and information stored on the storage devicesmay include structured, semi-structured, and unstructured data. Thestorage system includes a storage operating system that implements astorage manager, such as a file system, which provides a structuring ofdata and metadata that enables reading/writing of data on the storagedevices of the storage system. It is noted that the term “file system”as used herein does not imply that the data must be in the form of“files” per se.

Each file maintained by the storage system is represented by a treestructure of data and metadata, the root of which is an inode. Each filehas an inode within an inode file (or container, in embodiments in whichthe storage system supports flexible volumes), and each file isrepresented by one or more indirect blocks. The inode of a file is ametadata container that includes various items of information about thatfile, including the file size, ownership, last modified time/date, andthe location of each indirect block of the file. Each indirect blockincludes a number of entries. Each entry in an indirect block contains avolume block number (VBN) (or physical volume block number(PVBN)/virtual volume block number (VVBN) pair, in embodiments in whichthe storage system supports flexible volumes), and each entry can belocated using a file block number (FBN) given in a data access request.The FBNs are index values which represent sequentially all of the blocksthat make up the data represented by an indirect block. An FBNrepresents the logical position of the block within a file. Each VBN isa pointer to the physical location at which the corresponding FBN isstored on disk. In embodiments in which the storage system supportsflexible volumes, a VVBN identifies an FBN location within the file andthe file system uses the indirect blocks of the container file totranslate the FBN into a PVBN location within a physical volume.

When a write request (e.g., block access or file access) is received bythe storage system, the data is saved temporarily as a number of fixedsize blocks in a buffer cache (e.g., RAM). At some later point, the datablocks are written to disk or other non-volatile mass storage device,for example, during an event called a “consistency point.” In someembodiments, prior to the data blocks being written to disk, thetechnology introduced herein examines the contents of each queued datablock to determine whether the contents of the data block correspond tospecial data. When the contents of a data block correspond to data thathas been previously identified as special data (e.g., a zero-filleddata), a special VBN pointer (or VVBN/PVBN pair) is assigned to thecorresponding indirect block of the file to signify that the data isspecially allocated. That is, in response to a write request, thestorage system determines whether the file (or block) includes speciallyallocated data and, if so, the storage system assigns a correspondingspecial pointer value to the indirect blocks of the file to identify thelocations of the data that have been pre-allocated on disk. For example,if a data block corresponds to a special, zero-filled block, thecorresponding VBN pointer in the level 1 indirect block may be assigneda special VBN (e.g., VBN=0), thereby signifying that the zero-filleddata block is specially allocated on disk. As introduced herein, datablocks containing special data are removed from the buffer cache so thatthey are not written to disk. By not issuing write requests to storespecial data on disk, the technology introduced herein avoids diskfragmentation caused by freeing duplicate data blocks. Also, by notissuing write requests to store special data on disk, the technologyintroduced herein substantially reduces the processing time and overheadassociated with deduplicating duplicate data blocks.

In some embodiments described herein, one or more of VBN pointers aredefined, each signifying that a data block referenced by the pointercontains special data that is specially allocated on disk. For example,the storage system may define a special VBN pointer labeled VBN_ZERO,which signifies that any data block referenced by the pointer iszero-filled and that the zero-filled data is pre-allocated on disk. Asanother example, the storage system may define a special VBN pointer (orPVBN/VVBN pair) labeled VBN _HEADER, which signifies any block whosecontents correspond to a particular document header type and that thecontents of the particular document header are pre-allocated on disk.When a read request is received by the storage system, the storagesystem determines whether the corresponding block pointer to the file(or block) matches a special pointer (e.g., VBN_ZERO) and, if so, thestorage system reads the specially allocated data block from memory(rather than retrieving the data from disk). By accessing special datain memory (e.g., RAM) of the storage system, requests for such data canbe responded to substantially faster because the special data may beread without accessing the disk. In addition, by not issuing requests toaccess special data exceeding a pre-defined sharing threshold on disk,the technology introduced herein eliminates hot spots associated withreading a single instance of a deduplicated data block.

Before considering the technology introduced herein in greater detail,it is useful to consider an environment in which the technology can beimplemented. FIG. 1 is a data flow diagram that illustrates variouscomponents or services that are part of a storage network. A storageserver 100 is connected to a non-volatile storage subsystem 110 whichincludes multiple mass storage devices 120, and to a number of clients130 through a network 140, such as the Internet or a local area network(LAN). The storage server 100 may be a file server used in a NAS mode, ablock-based server such as used in a storage area network (SAN), or aserver that can operate in both NAS and SAN modes. The storage server100 provides storage services relating to the organization ofinformation on storage devices (e.g., disks) 120 of the storagesubsystem 110.

The clients 130 may be, for example, a personal computer (PC),workstation, server, etc. A client 130 may request the services of thestorage server 100, and the system may return the results of theservices requested by the client 130 by exchanging packets ofinformation over the network 140. The client 130 may issue a requestusing a file-based access protocol, such as the Common Internet FileSystem (CIFS) protocol or Network File System (NFS) protocol, overTCP/IP when accessing information in the form of files and directories.Alternatively, the client may issue a request using a block-based accessprotocol, such as the Small Computer Systems Interface (SCSI) protocolencapsulated over TCP (iSCSI) and SCSI encapsulated over Fibre Channel(FCP), when accessing information in the form of blocks.

The storage subsystem 110 is managed by the storage server 100. Thestorage server 100 receives and responds to various transaction requests(e.g., read, write, etc.) from the clients 130 directed to data storedor to be stored in the storage subsystem 110. The mass storage devices120 in the storage subsystem 110 may be, for example, magnetic disks,optical disks such as CD-ROM or DVD based storage, magneto-optical (MO)storage, or any other type of non-volatile storage devices suitable forstoring large quantities of data. Such data storage on the storagesubsystem may be implemented as one or more storage volumes thatcomprise a collection of physical storage devices (e.g., disks) 120cooperating to define an overall logical arrangement of volume blocknumber (“VBN”) space on the volumes. Each logical volume is generally,although not necessarily, associated with a single file system. Thestorage devices 120 within a volume are may be organized as one or moregroups, and each group can be organized as a Redundant Array ofInexpensive Disks (RAID), in which case the storage server 100 accessesthe storage subsystem 110 using one or more well-known RAID protocols.However, other implementations and/or protocols may be used to organizethe storage devices 120 of storage subsystem 110.

In some embodiments, the technology introduced herein is implemented inthe storage server 100 or in other devices. For example, the technologycan be adapted for use in other types of storage systems that provideclients with access to stored data or processing systems other thanstorage servers. While various embodiments are described in terms of theenvironment described above, those skilled in the art will appreciatethat the technology may be implemented in a variety of otherenvironments including a single, monolithic computer system, as well asvarious other combinations of computer systems or similar devicesconnected in various ways. For example, in some embodiments, the storageserver 100 has a distributed architecture, even though it is notillustrated as such in FIG. 1.

FIG. 2 is a high-level block diagram showing an example architecture ofthe storage server 100. Certain well-known structures and functionswhich are not germane to this description have not been shown ordescribed. The storage server 100 includes one or more processors 200, amemory 205, a non-volatile random access memory (NVRAM) 210, one or moreinternal mass storage devices 215, a storage adapter 220, and a networkadapter 225 couple to an interconnect system 230. The interconnectsystem 230 shown in FIG. 2 is an abstraction that represents any one ormore separate physical buses and/or point-to-point connections,connected by appropriate bridges, adapters and/or controllers. Theinterconnect system 230, therefore, may include, for example, a systembus, a form of Peripheral Component Interconnect (PCI) family bus, aHyperTransport or industry standard architecture (ISA) bus, a smallcomputer system interface (SCSI) bus, a universal serial bus (USB), oran Institute of Electrical and Electronics Engineers (IEEE) standard1394 bus (sometimes referred to as “Firewire”).

The processors 200 are the central processing units (CPUs) of thestorage server 100 and, thus, control its overall operation. In someembodiments, the processors 200 accomplish this by executing softwarestored in memory 205. A processor 210 may be, or may include, one ormore programmable general-purpose or special-purpose microprocessors,digital signal processors (DSPs), programmable controllers, applicationspecific integrated circuits (ASICs), programmable logic devices (PLDs),or the like, or a combination of such devices.

Memory 205 includes the main memory of the storage server 100. Memory205 represents any form of random access memory (RAM), read-only memory(ROM), flash memory, or the like, or a combination of such devices.Memory 205 stores (among other things) the storage operating system 235.The storage operating system 235 implements a storage manager, such as afile system manager, to logically organize the information as ahierarchical structure of directories, files, and special types of filescalled virtual disks on the disks. A portion of the memory 205 isorganized as a buffer cache 240 for temporarily storing data associatedwith requests issued by clients 130 that are, during the course of aconsistency point, flushed (written) to disk or another non-volatilestorage device. The buffer cache includes a plurality of storagelocations or buffers organized as a buffer tree structure. A buffer treestructure is an internal representation of loaded blocks of data for,e.g., a file or virtual disk (vdisk) in the buffer cache 240 andmaintained by the storage operating system 235. In some embodiments, aportion of the memory 205 is organized as a “specially allocated data”(SAD) data structure 245 for storing single instances of data that areto be, or have been, specially allocated by the storage server 100.

The non-volatile RAM (NVRAM) 210 is used to store changes to the filesystem between consistency points. Such changes may be stored in anon-volatile log (NVLOG) 250 that is used in the event of a failure torecover data that would otherwise be lost. In the event of a failure,the NVLOG is used to reconstruct the current state of stored data justprior to the failure. In some embodiments, the NVLOG 250 includes aseparate entry for each write request received from a client 130 sincethe last consistency point. In some embodiments, the NVLOG 250 includesa log header followed by a number of entries, each entry representing aseparate write request from a client 130. Each request may include anentry header followed by a data field containing the data associatedwith the request (if any), e.g., the data to be written to the storagesubsystem 110. The log header may include an entry count, a CP(consistency point) count, and other metadata. The entry count indicatesthe number of entries currently in the NVLOG 250. The CP countidentifies the last consistency point to be completed. After eachconsistency point is completed, the NVLOG 250 is cleared and startedanew. The size of the NVRAM is variable. However, it is typically sizedsufficiently to log a certain time-based chunk of requests from clients130 (for example, several seconds worth).

Also connected to the processors 200 through the interconnect system 230are one or more internal mass storage devices 215, a storage adapter 220and a network adapter 225. Internal mass storage devices 215 may be orinclude any computer-readable storage medium for storing data, such asone or more disks. As used herein, the term “disk” refers to anycomputer-readable storage medium including volatile (e.g., RAM),nonvolatile (e.g., ROM, Flash, etc.), removable, and non-removablemedia, or any combination of such media devices that are capable ofstoring information such as computer-readable instructions, datastructures, program modules, or other data. It is further noted that theterm “disk” may refer to physical or virtualized computer-readablestorage media. The storage adapter 220 allows the storage server 100 toaccess the storage subsystem 110 and may be, for example, a FibreChannel adapter or a SCSI adapter. The network adapter 225 provides thestorage server 100 with the ability to communicate with remote devices,such as the clients 130, over a network and may be, for example, anEthernet adapter, a Fibre Channel adapter, or the like.

FIG. 3 shows an example of the architecture of the storage operatingsystem 235 of the storage server 100. As shown, the storage operatingsystem 235 includes several software modules or “layers.” These layersinclude a storage manager 300. The storage manager layer 300 isapplication-layer software that services data access requests fromclients 130 and imposes a structure (e.g., hierarchy) on the data storedin the storage subsystem 110 and storage devices 215.

In some embodiments, storage manager 300 implements a write in-placefile system algorithm, while in other embodiments the storage manager300 implements a write-anywhere file system. In a write in-place filesystem, the locations of the data structures, such as inodes and datablocks, on disk are typically fixed and changes to such data structuresare made “in-place.” In a write-anywhere file system, when a block ofdata is modified, the data block is stored (written) to a new locationon disk to optimize write performance (sometimes referred to as“copy-on-write”). A particular example of a write-anywhere file systemis the Write Anywhere File Layout (WAFL®) file system available fromNetApp, Inc. of Sunnyvale, Calif. The WAFL® file system is implementedwithin a microkernel as part of the overall protocol stack of a storageserver and associated storage devices, such as disks. This microkernelis supplied as part of Network Appliance's Data ONTAP® software. It isnoted that the technology introduced herein does not depend on the filesystem algorithm implemented by the storage manager 300.

Logically “under” the storage manager layer 300, the storage operatingsystem 235 also includes a multi-protocol layer 305 and an associatedmedia access layer 310, to allow the storage server 100 to communicateover the network 140 (e.g., with clients 130). The multi-protocol layer305 implements various higher-level network protocols, such as NetworkFile System (NFS), Common Internet File System (CIFS), Direct AccessFile System (DAFS), Hypertext Transfer Protocol (HTTP) and/orTransmission Control Protocol/Internet Protocol (TCP/IP). The mediaaccess layer 310 includes one or more drivers which implement one ormore lower-level protocols to communicate over the network, such asEthernet, Fibre Channel or Internet small computer system interface(iSCSI).

Also logically “under” the storage manager layer 300, the storageoperating system 235 includes a storage access layer 315 and anassociated storage driver layer 320, to allow the storage server 100 tocommunicate with the storage subsystem 110. The storage access layer 315implements a higher-level disk storage protocol, such as RAID, while thestorage driver layer 320 implements a lower-level storage device accessprotocol, such as Fibre Channel Protocol (FCP) or small computer systeminterface (SCSI). Also shown in FIG. 3 is a path 325 of data flow,through the storage operating system 235, associated with a request.

In some embodiments, the storage operating system 235 includes a specialallocation layer 330 logically “above” the storage manager 300. Thespecial allocation layer 330 is an application layer that examines thecontents of data blocks included in the NVLOG 250 to determine whetherthe contents correspond to specially allocated data. For example,specially allocated data may include zero-filled data blocks, datablocks whose contents correspond to a particular document header type,data blocks exceeding a pre-defined sharing threshold, or otherdesignated data. In yet another embodiment, the special allocation layer330 is included in the storage manager 300. Note, however, that thespecial allocation layer 330 does not have to be implemented by thestorage server 100. For example, in some embodiments, the specialallocation layer 330 is implemented in a separate system to which theNVLOG 250, buffer cache 240, or data blocks are provided as input.

In operation, a write request issued by a client 130 is forwarded overthe network 140 and onto the storage server 100. A network driver (oflayer 310) processes the write request and, if appropriate, passes therequest on to the multi-protocol layer 305 for additional processing(e.g., translation to an internal protocol) prior to forwarding to thestorage manager 300. The write request is then temporarily stored(queued) by the storage manager 300 in the NVLOG 250 of the NVRAM 210and temporarily stored in the buffer cache 240. In some embodiments, thespecial allocation layer 330 examines the contents of queued writerequests (data blocks) to determine whether the contents correspond tospecially allocated data. When the contents of a data block correspondsto data that has been previously identified as special data, a specialVBN (or VVBN/PVBN pair) is assigned to the corresponding level 1 blockpointer in the buffer tree that contains the block and the data block isremoved from the buffer cache 240 so that it is not flushed to disk. Forexample, a special VBN labeled VBN_ZERO may be assigned to a level 1block pointer to signify that the data for the corresponding level 0block is zero-filled and has been pre-allocated on disk. By not issuingwrite requests to store special data on disk, the technology introducedherein avoids disk fragmentation caused by freeing duplicate datablocks. Also, by not issuing write requests to store special data ondisk, the technology introduced herein substantially reduces theprocessing time and overhead associated with deduplicating duplicatedata blocks.

Subsequently, if a read request is received, the storage manager 300indexes into the inode of the file using a file block number (FBN) givenin the request to access an appropriate entry and retrieve a volumeblock number (VBN). If a retrieved VBN corresponds to a special VBN(e.g., VBN_ZERO), the storage manager 300 reads the specially allocateddata from memory (e.g., from the SAD data structure 245). Otherwise, thestorage manager 300 generates operations to read the requested data fromdisk 120, unless the data is present in the buffer cache 240. If thedata is in the buffer cache 240, the storage manager reads the data frombuffer cache 240. Otherwise, if the data is not in the buffer cache 240,the storage manager 300 passes a message structure including the VBN tothe storage access layer 315 to map the VBN to a disk identifier anddisk block number (disk, DBN), which are then sent to an appropriatedriver (e.g., SCSI) of the storage driver layer 320. The storage driveraccesses the DBN from the specified disk 120 and loads the requesteddata blocks into buffer cache 240 for processing by the storage manager300. By accessing special data in memory 205 of the storage server 100,requests for such data can be responded to substantially faster becausethe special data may be read without accessing the disk and/or thestorage subsystem 110. In addition, by not issuing requests to accessspecial data exceeding a pre-defined sharing threshold on disk, thetechnology introduced herein eliminates hot spots associated withreading a single instance of a deduplicated data block.

FIG. 4 illustrates the relevant functional elements of the specialallocation layer 330 of the storage operating system 235, according toone embodiment. It is noted that the special allocation layer mayoperate on any data that is pre-allocated on storage devices 120 and/or215 of storage system 100. However, to facilitate description herein, itis assumed that the special allocation layer operates on data that ispre-allocated on storage devices 120. The special allocation layer 330(shown in FIG. 4) includes a special allocation component 400. Thespecial allocation component 400 examines the contents of data blocks405 included in NVLOG 250 to determine whether the contents correspondto data that has been previously identified as special data, such asspecially allocated data 410 included in the SAD data structure 245. Forexample, the allocation component 400 may compare the contents of a datablock 405 to the specially allocated (e.g., zero-filled) data 410. Asanother example, the special allocation component 400 may compute a hashof a data block 405 and compare the hash to a hash of the speciallyallocated data 410. When the contents of a data block 405 correspond tospecially allocated data 410, the special allocation component 400removes the data block 405 from the buffer cache 240 and assigns aspecial pointer to the corresponding level 1 block pointer to signifythat the data block 405 contains specially allocated data 410 that ispre-allocated on disk 120.

In some embodiments, the special allocation component 400 processes theNVLOG 250 prior to the data blocks being written to disk 120. This mayoccur, for example, during an event called a “consistency point”, inwhich the storage server 100 stores new or modified data to its massstorage devices 120 based on the write requests temporality stored inthe buffer cache 240. However, it is noted that, in some embodiments, aconsistency point may begin before the special allocation component 400finishes examining all of the data block 410 of NVLOG 250. In suchcases, the data blocks 410, which have not yet been examined, areexamined during the consistency point and before the consistency pointcompletes. That is, each unexamined data block is examined before beingflushed to disk.

In some embodiments, the special allocation component 400 is embodied asone or more software modules within the special allocation layer 330 ofthe storage operating system 235. In other embodiments, however, thefunctionality provided by the special allocation component isimplemented, at least in part, by one or more dedicated hardwarecircuits. The special allocation component 400 may be stored ordistributed on, for example, computer-readable media, includingmagnetically or optically readable computer discs, hard-wired orpreprogrammed chips (e.g., EEPROM semiconductor chips), nanotechnologymemory, or other computer-readable storage medium. Indeed, computerimplemented instructions, data structures, screen displays, and otherdata under aspects of the technology described herein may be distributedover the Internet or over other networks (including wireless networks),on a propagated signal on a propagation medium (e.g., an electromagneticwave(s), etc.) over a period of time, or they may be provided on anyanalog or digital network (packet switched, circuit switched, or otherscheme).

Returning to FIG. 3, in some embodiments, the storage manager 300cooperates with virtualization layers (e.g., vdisk layer 335 andtranslation layer 340) to “virtualize” the storage space provided bystorage devices 120. That is, the storage manager 300, together with thevdisk layer 335 and the translation layer 340, aggregates the storagedevices 120 of storage subsystem 110 into a pool of blocks that can bedynamically allocated to form a virtual disk (vdisk). A vdisk is aspecial file type in a volume that has associated export controls andoperation restrictions that support emulation of a disk. The vdiskincludes a special file inode that functions as a container for storingmetadata associated with the emulated disk. It should be noted that thestorage manager 300, the vdisk layer 335, and translation layer 340 canbe implemented in software, hardware, firmware, or a combinationthereof.

The vdisk layer 335 is layered on the storage manager 300 to enableaccess by administrative interfaces, such as user interface (UI) 345, inresponse to a user (e.g., a system administrator) issuing commands tothe storage server 100. The vdisk layer 335 implements a set of vdisk(LUN) commands issued through the UI 345 by a user. These vdisk commandsare converted to file system operations that interact with the storagemanager 300 and the translation layer 340 to implement the vdisks.

The translation layer 340, in turn, initiates emulation of a disk or LUNby providing a mapping procedure that translates LUNs into the specialvdisk file types. The translation layer 340 is logically “between” thestorage access layer 315 and the storage manager 300 to providetranslation between the block (LUN) space and the file system space,where LUNs are represented as blocks. In some embodiments, thetranslation layer 340 provides a set of application programminginterfaces (APIs) that are based on the SCSI protocol and that enable aconsistent interface to both the iSCSI and FCP drivers. In operation,the translation layer 340 may transpose a SCSI request into a messagerepresenting an operation directed to the storage manager 300. A messagegenerated by the translation layer 340 may include, for example, a typeof operation (e.g., read, write) along with a pathname and a filename ofthe vdisk object represented in the file system. The translation layer340 passes the message to the storage manager 300 where the operation isperformed.

In some embodiments, the storage manager 300 implements so called“flexible” volumes (hereinafter referred to as virtual volumes (VVOLs)),where the file system layout flexibly allocates an underlying physicalvolume into one or more VVOLs. FIG. 5 illustrates an aggregate 500, inone embodiment. As used herein, the underlying physical volume for aplurality of VVOLs 505 is an aggregate 500 of one or more groups ofdisks of a storage system.

As illustrated in FIG. 5, each VVOL 505 can include named logical unitnumbers (LUNs) 510, directories 515, qtrees 520, and files 525. A qtree520 is a special type of directory that acts as a “soft” partition,i.e., the storage used by the qtrees is not limited by space boundaries.The aggregate 500 is layered on top of the translation layer 340, whichis represented by at least one RAID plex 530, wherein each plex 530includes at least one RAID group 535. Each RAID group further includes aplurality of disks 540, e.g., one or more data (D) disks and at leastone (P) parity disk.

Whereas the aggregate 500 is analogous to a physical volume of aconventional storage system, a VVOL 505 is analogous to a file withinthat physical volume. That is, the aggregate 500 may include one or morefiles, wherein each file contains a VVOL 505 and wherein the sum of thestorage space consumed by flexible volumes associated with the aggregate500 is physically less than or equal to the size of the overall physicalvolume. The aggregate 500 utilizes a physical volume block number (PVBN)space that defines the storage space of blocks provided by the disks ofthe physical volume, while each VVOL embedded within a file utilizes a“logical” or “virtual” volume block number (VVBN) space in order toorganize those blocks as files. The PVBNs reference locations on disksof the aggregate 500, whereas the VVBNs reference locations within filesof the VVOL 505. Each VVBN space is an independent set of numbers thatcorresponds to locations within the file that may be translated to diskblock numbers (DBNs) on disks 120. Since a VVOL 505 is also a logicalvolume, it has its own block allocation structures (e.g., active, spaceand summary maps) in its VVBN space.

Each VVOL 505 may be a separate file system that is “mingled” onto acommon set of storage in the aggregate 500 by an associated storageoperating system 235. In some embodiments, the translation layer 335 ofthe storage operating system 235 builds a RAID topology structure forthe aggregate 500 that guides each file system when performing writeallocations. The translation layer 335 also presents a PVBN to diskblock number (DBN) mapping to the storage manager.

A container file may be associated with each VVOL 505. As used herein, acontainer file is a file in the aggregate 500 that contains all blocksof the VVOL 505. In some embodiments, the aggregate 500 includes onecontainer file per VVOL 505. FIG. 6 is a block diagram of a containerfile 600 for a VVOL 505. The container file 600 has an inode 605 of theflexible volume type that is assigned an inode number equal to a virtualvolume id (VVID). The container file 600 is typically one large, sparsevirtual disk and, since it contains all blocks owned by its VVOL. It isnoted that a block with VVBN of “x” in the VVOL 505 can be found at thefile block number (FBN) of “x” in the container file 600. For example,VVBN 2000 in the VVOL 505 can be found at FBN 2000 in its container file600. Since each VVOL 505 in the aggregate 500 has its own distinct VVBNspace, another container file may have FBN 2000 that is different fromFBN 2000 in the container file 600. The inode 605 references indirectblocks 610, which, in turn, reference both physical data blocks 615 andvirtual data blocks 620 at level 0.

FIG. 7 is a block diagram of a buffer tree of a file 700 within thecontainer file 600. The file 700 is assigned an inode 705, whichreferences indirect blocks 710. The buffer tree is an internalrepresentation of blocks of file 700 loaded into the buffer cache 240maintained by the storage operating system 235. It is noted that theremay be additional levels of indirect blocks 710 (e.g., level 2, level 3)depending upon the size of the file 700. In a file within a flexiblevolume, an indirect block 710 stores references to both the physical VBN(PVBN) and the virtual VBN (VVBN). The PVBN references a physical block715 in the aggregate 500 and the VVBN references a logical block 720 inthe VVOL 505. FIG. 7 shows the indirect blocks 710 referencing bothphysical data blocks 715 and virtual data blocks 720 at level 0. In someembodiments, the special allocation layer 330 supports VVOLs. Thus, inembodiments where VVOLs are supported, the special allocation layer 330may define one or more special VVBN/PVBN pointer pairs that are used toindicate that the level 0 data blocks associated VVBN/PVBN pairs arespecially allocated and pre-allocated on disk. In operation, when thecontents of a data block corresponds to data that has been previouslyidentified as special data, a special VVBN/PVBN pair is assigned to thecorresponding level 1 block pointer and the data block is removed fromthe buffer cache 240 so that it is not flushed to disk. Subsequently,when the block is requested, the storage manager 300 identifies thespecial VVBN/PVBN pair and reads the corresponding specially allocateddata block from memory 245 (not disk 120). By accessing special data inmemory 205 of the storage server 100, requests for such data can beresponded to substantially faster because the special data may be readwithout accessing the disk and/or the storage subsystem 110. Moreover,by not issuing write requests to store special data on disk, thetechnology introduced herein avoids disk fragmentation caused by freeingduplicate data blocks. Also, by not issuing write requests to storespecial data on disk, the technology introduced herein substantiallyreduces the processing time and overhead associated with deduplicatingduplicate data blocks. In addition, by not issuing requests to accessspecial data exceeding a pre-defined sharing threshold on disk, thetechnology introduced herein eliminates hot spots associated withreading a single instance of a deduplicated data block.

FIG. 8 is a flow chart of a process 800 for specially allocating datablocks prior to the data blocks being written to disk. In someembodiments, the process 800 is performed by special allocationcomponent 400. To facilitate description, it is assumed that the storageserver 100 receives a request from a client 130 to write a 10 gigabyte(GB) virtual disk image file to disk (e.g., such a request may bereceived as a result of a user creating a virtual machine). Tofacilitate description, it is further assumed that the storage manager300 operates on data arranged in 4 kilobyte (kb) blocks. Thus, thevirtual disk image file received by the storage server 100 is convertedfrom 10 GB to 20,971,520 4-kb blocks. In addition, to facilitatedescription, it is assumed that a 4 kilobyte (kb), zero-filled datablock 410 has been specially allocated by the storage server 100 inmemory 245. It is noted that the process 800 may be used to speciallyallocate other types of data. For example, the process 800 may beemployed to specially allocate data blocks whose contents correspond toa particular document header type. As such, the special allocation ofzero-filled blocks should not be taken as restrictive.

In some embodiments, the process 800 is invoked in response to an entryassociated with a write request being added to the NVLOG 250. While inother embodiments, the process 800 is invoked after a pre-defined numberof entries are added to the NVLOG 250. In yet other embodiments, theprocess 800 is invoked as a precursor to, or as part of, a consistencypoint.

Initially, at step 805, the special allocation component 400 selects adata block 405 (FIG. 4) from the NVLOG 250. For example, the selecteddata block may be one of the 20,971,520 4-kb zero-filled data blocks ofthe virtual disk image file.

Next, at step 810, the special allocation component 400 determineswhether the contents of the selected data block 405 correspond tospecially allocated data 410. For example, the allocation component 400may compare the contents of the selected block 405 to the contents of aspecially allocated, zero-filled block 410. As another example, thespecial allocation component 400 may compute a hash of the selectedblock 405 and compare the hash to a hash of the specially allocated,zero-filled block 410. If the contents of the selected data block 405 donot correspond to specially allocated data 410, the process continues atstep 825, as described below. Otherwise, if the contents of the selecteddata block 405 correspond to specially allocated data 410 (e.g., forzero-filled special data, if the hash is zero), the process proceeds tostep 815.

At step 815, the special allocation component 400 assigns a specialpointer (e.g., VBN_ZERO) to the corresponding level 1 indirect block ofthe file to signify that the contents of the data block 405 correspondsto specially allocated data (e.g., is zero-filled) and has beenpre-allocated on disk. Then the process proceeds to step 820.

At step 820, the special allocation component 400 removes the selectedblock 405 from the buffer cache 240 so that the block 405 is not writtento disk 120. Then the process proceeds to step 825.

At step 825, the special allocation component 400 determines whether allof the blocks 405 in the NVLOG 250 have been selected for processing. Ifany block 405 remains, the process continues at step 805 where thespecial allocation component 400 selects a block 405, as describedabove. Otherwise, if all of the blocks 405 have been selected, theprocess ends.

Those skilled in the art will appreciate that the steps shown in FIG. 8and in each of the following flow diagrams may be altered in a varietyof ways. For example, the order of certain steps may be rearranged;certain substeps may be performed in parallel; certain shown steps maybe omitted; or other steps may be included; etc.

FIG. 9 is a flow chart of a process 900 for servicing a read request, inone embodiment. In some embodiments, the process 900 is performed by thestorage manager 300. However, it is noted that the process 900 may beused performed by another component of the storage server and/or anothercomputing device. As such, references to the storage manager 300 shouldnot be taken as restrictive.

In some embodiments, the process 900 is invoked in response to thestprage server 100 receiving a read request from a client 130.Initially, the process begins at step 905, when the storage manager 300receives a read request, such as a file request, a block request, and soon. Next, at step 910, the storage manager 300 processes the request by,for example, converting the request to a set of file system operations.Then, the process proceeds to step 915. At step 915, the storage manager300 identifies the data blocks to load. This may be accomplished, forexample, by identifying the inode corresponding to the request. Then,the process proceeds to step 920.

At step 920, for each identified block, a determination is made as towhether the block is specially allocated. This determination may bemade, for example, by examining the corresponding level 1 block pointerreferencing the data block to determine whether it is a predeterminedspecial pointer. For each specially allocated block, the processproceeds to step 925. Otherwise, the process continues at step 930, asdescribed below.

At step 925, for each block that is specially allocated, the storagemanager 300 reads the block 410 from SAD buffer 245. Then, the processcontinues at step 945, as described below.

At step 930, for each block that is not specially allocated, the storagemanager 300 determines whether the block is stored in the buffer cache240. For each block that is stored in the buffer cache 240, the processproceeds to step 935. Otherwise, the process continues at step 940, asdescribed below.

At step 935, for each block that is stored within the buffer cache 240,the storage manager 300 reads the block from the buffer cache 240. Then,the process continues at step 945, as described below.

At step 940, for each block that is not specially allocated or storedwithin the buffer cache 240, the storage manager 300 retrieves the blockfrom disk 120. Then, the process proceeds to step 945.

At step 945, the storage manager 300 determines whether there are moreblocks to load. If there are more data blocks to load, the processproceeds to step 950. At step 950, the storage manager 300 selects thenext block. Then the process proceeds to step 920, as described above.Otherwise, if there are no more blocks to load at step 945, the processproceeds to step 955. At step 955, the storage server 100 returns therequested data blocks to the client 130. Then the process ends.

Thus, a system and method for specially allocating data has beendescribed. Note that references in this specification to “anembodiment”, “one embodiment”, “some embodiments”, or the like, meanthat the particular feature, structure or characteristic being describedis included in at least one embodiment of the present invention.Occurrences of such phrases in this specification do not necessarily allrefer to the same embodiment. Although the technology introduced hereinhas been described with reference to specific exemplary embodiments, itwill be recognized that the invention is not limited to the embodimentsdescribed, but can be practiced with modification and alteration withinthe spirit and scope of the appended claims. Accordingly, thespecification and drawings are to be regarded in an illustrative senserather than a restrictive sense.

1-17. (canceled)
 18. A method in a computing system for speciallyallocating data within the computing system, the method comprising:receiving, by an allocation component executing on a processor of thecomputing system, a request from a client to write a set of data blocksin a non-volatile mass storage facility of the computing system, the setof data blocks including a data block that includes data previouslydesignated as special data, wherein the special data is pre-allocated inthe non-volatile mass storage facility; for each data block in the setof data blocks, comparing, by the allocation component executing on theprocessor of the computing system, the data block to the special data,when the data block is determined to match the special data, assigning aspecial block pointer to the data block, the special block pointerpointing to a block address in the non-volatile mass storage facility,the special block pointer indicating that the corresponding data blockwas pre-allocated in both the non-volatile mass storage facility and avolatile memory of the computing system; and preventing the data blockfrom being written to the non-volatile mass storage facility as a resultof the received request.
 19. The method of claim 18, further comprising:in response to a request to read a data block having a block pointerfrom the non-volatile mass storage facility, determining whether theblock pointer matches the special block pointer; and when the blockpointer matches the special block pointer, reading the contents of thepre-allocated data block from the volatile memory of the storage systemwithout issuing a request to read the contents of the pre-allocated datablock from the non-volatile mass storage facility.
 20. The method ofclaim 18, further comprising: when the data is determined not to matchthe special data, storing the data block in the non-volatile massstorage facility of the computing system.
 21. The method of claim 18,wherein the special data is a header file.
 22. The method of claim 18,wherein the special data is a zero-filled block of data.
 23. The methodof claim 22, wherein the received request is associated with the clientcreating a virtual disk image file of a virtual machine.
 24. The methodof claim 18, wherein comparing the data to the special data includescomputing a hash of the data and comparing the hash to a hash of thespecial data.
 25. The method of claim 18, wherein preventing the datablock from being written to the non-volatile mass storage facilityincludes removing the data block from a buffer cache of the computingsystem, wherein the buffer cache is used by the computing system tostore data blocks that are to be written to the non-volatile massstorage facility during a consistency point.
 26. The method of claim 25,wherein the method is performed prior to the consistency point.
 27. Themethod of claim 25, wherein the method is performed during theconsistency point and prior to completion of the consistency point. 28.The method of claim 18, wherein the special data has been designated asspecial by virtue of having satisfied a special block sharing criterion.29. A storage system comprising: a processor; a network communicationinterface to provide the storage system with data communication with atleast one client over a network; a storage interface to provide thestorage system with data communication with one or more mass storagedevices, wherein the one or more mass storage devices includes apre-allocated data block of special data; and a memory includingcontents of the pre-allocated data block and special allocation codethat, when executed by the processor, causes the storage system toexecute a special allocation process in response to a request receivedfrom the at least one client to read a portion of data from the one ormore mass storage devices, the portion of data having a block pointeridentifying a location of the one or more mass storage devices on whichthe portion of data is stored; comparing the block pointer to thespecial block pointer to determine whether the portion of data matchesthe contents of the pre-allocated data block; and when the block pointermatches the special block pointer, reading the contents of thepre-allocated data block from the memory of the storage system withoutissuing a request to read the contents of the pre-allocated data blockfrom the one or more mass storage devices.
 30. The storage system ofclaim 29, wherein the special allocations process further includes: inresponse to a request received from the at least one client to writedata to the one or more mass storage devices, the data including atleast one portion of data that matches the contents of the pre-allocateddata block, for each portion of data, comparing the portion of data tothe contents of the pre-allocated data block; when the portion of datais determined to match the contents of the pre-allocated data block,assigning a special block pointer to the portion of data, the specialblock pointer pointing to a block address in a non-volatile mass storagedevice; and when the portion of data is determined not to match thecontents of the pre-allocated data block, issuing a request to allocatestorage in the one or more mass storage devices to which the portion ofdata is to be written.
 31. The storage system of claim 30, wherein thestorage system further includes a buffer cache to store data that is tobe written to the one or more mass storage devices during a consistencypoint event, and wherein, when the portion of data is determined tomatch the contents of the pre-allocated data block, the portion of datais removed from the buffer cache prior to being written to the one ormore mass storage devices.
 32. The storage system of claim 31, whereinthe process is performed prior to the consistency point event.
 33. Thestorage system of claim 31, wherein the process is performed during theconsistency point event and prior to completion of the consistencypoint.
 34. The storage system of claim 30, wherein the storage systemfurther comprises a deduplication component, and wherein the processsubstantially reduces redundant allocation and deallocation of thepre-allocated data block in the one or more mass storage device of thestorage system.
 35. A non-transitory computer-readable storage mediumfor specially allocating data within a computing system, the storagemedium comprising: instructions for receiving a request from a client towrite a set of data blocks in a non-volatile mass storage facility ofthe computing system, the set of data blocks including a data block thatincludes data previously designated as special data, wherein the specialdata is pre-allocated in the non-volatile mass storage facility;instruction for comparing the data block to the special data for eachdata block in the set of data blocks; instruction for, when the datablock is determined to match the special data, assigning a special blockpointer to the data block, the special block pointer pointing to a blockaddress in the non-volatile mass storage facility, the special blockpointer indicating that the corresponding data block was pre-allocatedin both the non-volatile mass storage facility and a volatile memory ofthe computing system; instruction for preventing the data block frombeing written to the non-volatile mass storage facility as a result ofthe received request.
 36. The storage medium of claim 35, furthercomprising: instructions for, in response to a request to read a datablock having a block pointer from the non-volatile mass storagefacility, determining whether the block pointer matches the specialblock pointer; and instructions for, when the block pointer matches thespecial block pointer, reading the contents of the pre-allocated datablock from the volatile memory of the storage system without issuing arequest to read the contents of the pre-allocated data block from thenon-volatile mass storage facility.
 37. The storage medium of claim 35,further comprising: instructions for, when the data is determined not tomatch the special data, storing the data block in the non-volatile massstorage facility of the computing system.